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Velocity  and  Structure  Estimation  of  a  Moving  Object  Using  a  Moving 

Monocular  Camera 


V.  K.  Chitrakaran',  D.  M.  Dawson,  J.  Chen,  and  H.  Kannan 


Abstract — In  this  paper,  we  present  the  development  of 
a  vision-based  estimator  for  simultaneous  determination  of 
velocity  and  structure  of  an  object  (i.e.,  the  Euclidean  position  of 
its  feature  points)  for  the  general  case  where  both  the  object  and 
the  camera  are  moving  relative  to  an  inertial  frame  of  reference. 
The  velocity  estimation  itself  requires  no  explicit  kinematic 
model,  while  the  adaptive  structure  estimator,  synthesized 
utilizing  Lyapunov  design  methods,  is  built  upon  kinematic 
relationships  that  rely  on  homography-based  techniques. 

I.  Introduction 

The  application  of  a  camera  as  a  sensor  for  acquiring 
the  3D  structure  of  a  scene  is  known  as  “Structure  from 
Motion  (SfM)”  in  the  computer  vision  literature  [17].  The 
problem  usually  involves  a  camera  mounted  on  a  moving 
platform  such  as  a  mobile  robot  whose  image  data  is 
utilized  to  map  the  Euclidean  position  of  static  landmarks 
or  visual  features  in  the  environment.  Recent  applications 
of  this  technique  include  aerial  surveillance  and  mapping 
systems  such  as  the  work  presented  in  [10]  and  [12].  A 
significant  extension  to  the  above  problem  is  the  case  where 
both  the  object  of  interest  and  the  camera  are  in  motion, 
with  potential  impact  on  such  applications  as  vision-based 
collision  avoidance  systems  for  autonomous  guidance  of 
multiple  vehicles  on  highways  [9].  In  this  paper,  we  propose 
a  nonlinear  estimation  strategy  to  identify  the  Euclidean 
structure  and  velocity  of  a  moving  object  using  a  monocular 
calibrated  moving  camera.  The  proposed  algorithm  relies  on 
the  availability  of  a  reference  image  of  the  target  object, 
captured  at  a  known  orientation  relative  to  the  camera.  For 
the  general  case  where  both  the  camera  and  the  object  are 
in  motion  relative  to  an  inertial  frame,  a  single  geometric 
length  on  the  object  is  assumed  to  be  known,  and  sensors 
mounted  on  the  camera  are  assumed  to  provide  camera 
velocity  information.  Typically,  linearization  based  methods 
such  as  extended  Kalman  filtering  [1],  [5],  [17]  provide 
the  underlying  algorithmic  foundation  for  most  SfM  results, 
although  the  problem  of  estimating  3D  information  from  2D 
images  is  inherently  nonlinear.  In  this  work,  we  approach  the 
problem  using  nonlinear  system  analysis  tools.  Equations  for 
motion  kinematics  are  first  developed  in  terms  of  Euclid¬ 
ean  and  image-space  information  based  on  the  approach 
presented  in  [16].  An  integral  feedback  estimation  method 
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introduced  in  [4]  is  then  employed  to  identify  the  linear 
and  angular  velocity  of  the  moving  object.  The  estimated 
velocities  facilitate  the  development  of  a  measurable  error 
system  that  can  be  used  to  formulate  a  nonlinear  least  squares 
adaptive  update  law  for  Euclidean  structure  estimation.  Then, 
the  satisfaction  of  a  persistent  excitation  condition  (similar 
to  [2]  and  others)  allows  the  determination  of  the  coordinates 
for  all  the  feature  points  on  the  object  relative  to  the  camera. 

Although  the  requirement  of  velocity  sensors  on  the  cam¬ 
era  might  seem  too  restrictive  at  first,  the  proposed  algorithm 
does  allow  for  some  interesting  applications.  For  example,  a 
camera  on-board  an  aerial  reconnaissance  vehicle  could  be 
employed  to  track  and  estimate  the  dimensions  of  ground 
targets  (such  as  moving  vehicles)  utilizing  a  single  snapshot 
as  the  reference  image  captured,  for  example,  by  a  satellite. 
In  this  case,  the  camera  velocity  information  can  be  obtained 
from  the  Inertial  Navigation  system  on-board  the  aircraft. 
Similarly,  a  pan-tilt  camera  could  be  employed  as  a  passive 
radar  for  pose  and  velocity  recovery  of  a  rigid  moving  object 
in  its  field  of  view.  The  fact  that  camera  motion  is  allowed 
frees  the  moving  object  from  any  restrictive  assumptions  on 
its  trajectory,  as  the  camera  orientation  can  be  continuously 
updated  in  order  to  keep  the  object  within  its  field  of  view. 
A  system  based  on  this  technique  could  also  be  applied  in 
the  microscopic  domain:  a  camera  equipped  with  magnifying 
lenses  mapping  structure  and  motion  of  objects  too  small  to 
have  motion  sensors  embedded  on  them. 

The  remainder  of  this  paper  is  organized  as  follows.  In 
Section  II,  we  begin  with  the  development  of  a  geometric 
model  and  homography  relating  the  pixel  coordinates  of 
visual  markers  on  the  object,  extracted  from  the  reference 
image  and  the  continuous  sequence  of  images  from  the 
camera.  Section  III  describes  the  motion  kinematics  between 
the  camera  and  the  target  object.  This  is  followed  by  the 
development  of  estimators  for  relative  velocity  and  Euclidean 
position  in  Sections  IV  and  V,  respectively.  A  Lyapunov 
stability  analysis  for  Euclidean  position  estimation  is  pro¬ 
vided  in  Section  V-A.  In  Appendix  II,  we  illustrate  the  cases 
where  either  the  object  or  the  camera  is  stationary  relative 
to  the  inertial  frame.  These  special  cases  add  to  the  range 
of  practical  applications  of  this  algorithm,  especially  since 
some  of  assumptions  necessary  for  the  general  case  can  be 
relaxed.  Specifically,  no  velocity  sensors  on  the  camera  are 
required  for  the  special  cases. 

II.  Geometric  Model 

We  begin  the  development  of  a  geometric  model  relating 
the  camera  and  the  object  by  introducing  some  notation. 


c 


Fig.  1.  The  geometry  of  the  camera  frame  and  the  object  frame  relative 
to  the  inertial  frame. 

In  Figure  1,  the  inertial,  camera  and  the  object  frames  are 
denoted  by  the  orthogonal  coordinate  frames  X,  C,  and  O, 
respectively.  For  the  sake  of  simplicity,  it  is  assumed  that 
the  origin  of  C  coincides  with  the  optical  center  of  the 
camera.  The  vector  xc(t)  £  R3  and  the  matrix  Rc(t)  £ 
SO( 3)  denote  the  position  and  orientation,  respectively,  of 
the  camera  relative  to  the  inertial  frame  X,  and  expressed  in 
the  coordinates  of  X,  such  that  Rc(t)  defines  the  mapping 
Rc  :  C  — >  X.  Similarly,  the  position  and  orientation  of  the 
object  frame  relative  to  the  inertial  frame  are  quantified  by 
x0(t)  £  R3  and  R0(t)  £  50(3),  respectively,  where  R0{t) 
is  the  mapping  R0  :  O  — >  X.  We  also  define  the  quantities 
xco(t)  £  R3  and  Rco(t )  £  50(3)  to  be  the  position  and 
orientation  of  the  object  frame  relative  to  the  camera  frame, 
and  expressed  in  the  camera  frame  C,  such  that  Rco  :  O  — >  C. 
The  three  dimensional  structure  of  a  rigid  object  in  the  field 
of  view  of  the  camera  is  described  in  terms  of  the  3D  location 
of  various  visual  features  on  the  object  relative  to  the  camera 
frame.  If  the  Euclidean  coordinates  of  the  ith  feature  point 
Ot  on  the  object  is  denoted  by  the  constant  s*  £  R3  in  the 
object  frame  O ,  and  rrtj(t)  =  [  fhiX  fhiy  fhiZ  ]  £  R3 

in  the  camera  frame  C,  then,  it  can  be  seen  from  Figure  1 
that, 

TTli  —  Xco  T  RcoSi.  (1) 

We  now  develop  a  geometric  relationship  that  describes 
how  the  image  coordinates  of  a  feature  point  on  the  object 
change  with  relative  motion  between  the  camera  and  the 
object.  To  this  end,  we  define  a  reference  position  of  the 
object  relative  to  the  camera,  denoted  by  O*  in  Figure  1. 
At  this  reference  location  relative  to  the  camera,  the  position 
and  orientation  of  O* ,  and  the  Euclidean  coordinates  of  the 
feature  points  relative  to  the  camera  frame,  are  denoted  by 
the  constant  terms  x*a  £  R3,  R*a  £  50(3)  and  to*  £  R3, 
respectively.  Hence,  similar  to  (1),  we  have 

m*  =  x*co  +  R*coSi.  (2) 


After  solving  (2)  for  s,  and  then  substituting  the  resulting 
expression  into  (1),  we  have 

fhi  =  x  +  Rfh*  (3) 

where  R  (t)  £  SO  (3)  and  x  (f)  £  R3  are  new  rotational  and 
translational  variables,  respectively,  defined  as  follows 

R  =  Rco  ( R*co)T  >  x  =  xco-  Rx*co  ■  (4) 

Let  three  of  the  non-collinear  feature  points  on  the  object, 
denoted  by  Oi  Mi  =  1,  2,  3,  define  the  plane  n  in  frame 
O,  and  7r*  in  O*.  As  illustrated  in  Figure  1,  n*  £  R3 
denotes  the  constant  normal  to  the  plane  n*  expressed  in  the 
coordinates  of  C,  and  the  constant  projections  of  fh *  along 
the  unit  normal  n*,  denoted  by  d*  £  R,  are  given  by 

d*  =  n*Tfh*.  (5) 

Using  (5),  it  can  be  easily  seen  that  the  relationship  in 
equation  (3)  can  now  be  expressed  as  follows 


where  H(t)  £  R3x3  denotes  a  Euclidean  homography  [14]. 
To  express  the  above  relationship  in  terms  of  the  measurable 
image  space  coordinates  of  the  visual  features  relative  to 
the  camera  frame,  the  normalized  Euclidean  coordinates 
TOj(f),  to*  £  R3  are  defined  as  follows 


The  image  coordinates  of  these  feature  points  expressed 
relative  to  C  are  denoted  by  Pi(t),p*  £  R3  as  follows 

Pi=[ui  Vi  1  )r,  P*=  [  u*  v*  1  ]T  •  (8) 

The  image  coordinates  and  the  normalized  Euclidean  coor¬ 
dinates  are  related  by  the  pin-hole  camera  model  [15]  such 
that 

Pi  =  Ami ,  Pi  =  Am*  (9) 

where  A  £  R3x3  is  a  known,  constant,  upper  triangular 
and  invertible  intrinsic  camera  calibration  matrix.  From  (6) 
and  (9),  the  relationship  between  image  coordinates  of  the 
corresponding  feature  points  in  O  and  O  *  can  be  expressed 
as  follows 

Pi  =  fr—  A  (R  +  xhi{n*)T)  A-1  p* 


where  a 

d* 

G(t)  £  I 


£  R  denotes  the  depth  ratio,  and  Xhi(t)  = 
denotes  the  scaled  translation  vector.  The  matrix 


G(t)  £  R3x3  defined  in  (10)  is  a  full  rank  homogeneous 
collineation  matrix  defined  up  to  a  scale  factor  [15].  If  the 
structure  of  the  object  is  planar,  all  feature  points  lie  on  the 
same  plane,  and  hence  the  distances  d*  defined  in  (5)  is  the 
same  for  all  feature  points,  henceforth,  denoted  as  d*.  In 


this  case,  the  collineation  G(t)  is  defined  up  to  the  same 
scale  factor,  and  hence,  one  of  its  elements  can  be  set  to 
unity  without  loss  of  generality.  Given  a  reference  image  of 
the  object  corresponding  to  O * ,  and  a  continuous  stream  of 
images  of  the  object  from  the  camera  corresponding  to  O 
at  every  instant  of  time,  G(t)  can  be  estimated  from  a  set 
of  linear  equations  (10)  obtained  from  at  least  four  matched 
feature  points  ( Pi{t),p *)  that  are  coplanar  but  non-collinear. 
If  the  structure  of  the  object  is  not  planar,  the  Virtual  Parallax 
method  described  in  [15]  could  be  utilized,  where  three  of 
the  non-collinear  feature  points  on  the  object  are  utilized  to 
define  a  virtual  plane  at  the  distance  d*  from  the  camera.. 
An  overview  of  the  determination  of  the  collineation  matrix 
G(t)  and  the  depth  ratios  aRt)  for  both  the  planar  and  non- 
planar  cases  are  also  given  in  [3].  Based  on  the  fact  that 
the  intrinsic  camera  calibration  A  is  known  apriori,  we  can 
then  determine  the  Euclidean  homography  H(t).  By  utilizing 
various  techniques  (see  algorithms  in  [8],  [14],  [21]),  H(t) 
can  be  decomposed  into  its  constituent  rotation  matrix  R(t), 

A  X(t) 

unit  normal  vector  n*,  scaled  translation  vector  Xh(t)  =  — — 
and  the  depth  ratio  aRt).  It  is  assumed  that  the  constant 
rotation  matrix  R*a  is  known.  f?co(f)  can  therefore  be 
computed  from  (4).  Hence  Rco(t ),  R(t),Xh(t)  and  aRt)  are 
known  signals  that  can  be  used  in  the  subsequent  analysis. 

Remark  1:  The  subsequent  development  requires  that  the 
constant  rotation  matrix  R*a  be  known.  We  consider  this  to 
be  a  mild  assumption  since  the  reference  image  of  the  object 
can  be  acquired  offline  after  placing  the  object  or  the  camera 
at  some  known  orientation  relative  to  each  other. 


III.  Kinematics 

Let  vc(t),uic(t )  £  R3  denote  the  translational  and  rota¬ 
tional  velocities  of  the  camera,  relative  to  X  and  expressed  in 
the  coordinates  of  X.  In  the  coordinates  of  C,  the  same  physi¬ 
cal  quantities  relative  to  X  are  denoted  as  vcc(t),LOcc(t)  £  R3. 
Similarly,  v0(t),u0(t)  €  R3  and  v00(t),u)00(t)  £  R3  denote 
the  translational  and  rotational  velocities  of  the  object  frame, 
all  relative  to  X,  expressed  in  the  coordinates  of  X  and  O , 
respectively.  These  velocities  are  related  to  each  other  in  the 
following  manner, 

KTc  “cc  }T  =  Rl[vTc  cJ]T,  (11) 

KTo  U&]T  =  RTo  KT  ]T ■  (12) 

To  quantify  the  translation  between  the  coordinate  frames 
O*  and  O,  we  define  ev(t)  £  R3  in  terms  of  the  image 
coordinates  of  one  of  the  feature  points  on  the  plane  n.  For 
notational  simplicity,  we  chose  0\  and  hence, 

ev=[u\—u\  vi—v\  —  ln(ai)]T.  (13) 

The  signal  ev(t)  is  measurable  since  the  first  two  elements  of 
the  vector  are  obtained  from  the  images  and  the  last  element 
is  available  from  known  signals  as  discussed  in  the  previous 
section.  After  taking  the  time  derivative  of  (13),  the  following 
translational  kinematics  can  be  obtained  (see  Appendix  I  for 


details) 

ev  —  _  *  Ae i  [S(xco)ucc  Rcovr  T  RcoS(^si^)ur^  (14) 

where  S(-)  £  R3x3  denotes  the  skew-symmetric  form  of 
the  vector  Si  as  defined  in  [19],  and  vr(t),u>r(t)  £  R3  are 
the  relative  translational  and  rotational  velocities  between  the 
camera  and  the  object  defined  in  the  following  manner 

Vr  =  Ro  (Vc  -  V0)  ,  Ur  =  Ro  0Jo)  .  (15) 


In  (14),  Aei(t)  £  R3x3  is  a  function  of  the  camera  intrinsic 
calibration  parameters  and  image  coordinates  of  the  ith 
feature  point  as  shown  below 


APi  =  A  — 


0  0  Mi 
0  0  Mi 
0  0  0 


(16) 


Similarly,  to  quantify  the  rotation  between  O*  and  O,  we 
define  ew(t)  £  R3  using  the  axis-angle  representation  [19] 
as  follows 


ew  =  ix(j) 


(17) 


where  /z(f)  £  R3  represents  a  unit  rotation  axis,  and  cf>(t)  £  R 
denotes  the  rotation  angle  about  fi(t)  confined  to  the  region 
—7 r  <  cj>(t)  <  7 r,  and  defined  as  follows 

4,  =  c«->  (i  (M«)  -  fi)  ,  = 

(18) 

In  (18),  the  notation  t/r  (•)  denotes  the  trace  of  a  matrix.  After 
taking  the  time  derivative  of  (17),  the  following  expression 
for  rotational  kinematics  is  obtained  (see  Appendix  I  for 
details) 

:  LuRcoior  (19) 


where  the  Jacobian-like  term  Lu(t)  £  R3x3  is  given  by  the 
following  expression 


Lu  —  h  —  -^S  (fi)  + 


1  - 


nnc  (</)) 


smc * 


S  (/t)2  (20) 


2  ) 


and  sine  (</>)  = 


sin  (</)) 


.  From  (14)  and  (19),  the  kinematics 


of  relative  motion  between  the  camera  and  the  object  can 
thus  be  expressed  as  follows 


e  =  Jv  +  f 


(21) 


where  e(f)  =  [  ej  ej"  ]T  £  R6,  and  v(t)  = 
\  u>r  ]T  £  R6.  The  vector  f(t)  £  R6  and  the  matrix 

J(t)  £  R6x6  in  (21)  are  defined  as  follows 


J 

/ 


--z^-AelRco  AelRcoS(Sl ) 
m\z  tot 


03x3 


AeiS(xco')toc 


llz 
L/jjRc 


03 


(22) 

(23) 


where  O3X3  G  R3x3  denotes  a  3  x  3  zero  matrix,  and  O3  G 
R3  denotes  a  zero  vector. 

We  assume  that  a  single  geometric  length  s  1  G  R3  between 
two  feature  points  on  the  object  is  known.  This  allows  us  to 
compute  x*a  using  the  following  alternative  expression 

x*co  =  [diag{*n  -  72)  +  R ]  1  [diag( 71  -  72 )R*co  -  Rco\  «i 

(24) 

where 

71  =  [diag  (ml)]  1n*Tm*lXh ,  (25) 

72  =  [diag  (ml)]-1  —  (26) 

ol  1 

and  mi(t),m*,n*,Xh(t)  and  a\(t)  are  all  measurable  sig¬ 
nals.  Also,  since  m*  =  ~^(x*co  +  f?*QSi),  fh|2  and  fh\  can 
both  be  computed.  From  (4),  (5),  and  the  definition  for  x  h(t), 
we  can  compute  xco(t).  Also,  as  mentioned  previously,  it  is 
assumed  that  the  velocities  vcc(t),u>cc(t)  are  available  from 
sensors  on  the  camera.  Hence,  each  element  of  .J(t)  and  /(f) 
given  in  (22)  and  (23)  are  known. 

Remark  2:  By  exploiting  the  fact  that  /i(f)  is  a  unit  vector 
(i.e.,  || p.|| 2  =  1),  the  determinant  of  Lu  ( t )  can  be  derived  as 
follows  [16] 

detL“  =  — 1  {0\  '  (2?) 
sine2  I  —  J 

From  (27),  it  is  clear  that  Lu  ( t )  is  only  singular  for  multiples 
of  27t  (i.e.,  out  of  the  assumed  workspace).  It  can  also 
been  seen  that  det(Aeii?Co)  7^  0.  Hence,  the  matrix  J(<) 
is  invertible  [11]. 

IV.  Relative  Velocity  Estimation 

In  [20],  a  model-free  estimator  for  asymptotic  identifica¬ 
tion  of  a  velocity  signal  was  presented,  utilizing  only  the 
measured  position  signal  for  estimation.  Based  on  this  work, 
in  [4],  we  presented  a  detailed  analysis  of  the  application 
of  this  observer  for  estimation  of  the  velocity  of  a  moving 
object  in  the  field  of  view  of  a  fixed  camera.  Specifically, 
designating  e(f)  G  R6  as  the  estimate  for  the  kinematic  signal 
e(f),  the  observer  was  designed  as  follows 

e  =  f  (K  +  I6x6)e(T)dT  +  f  psgn  (e(r))  dr(28) 

J to  J to 

+(K  +  I6x6)e(t) 

where  e(f)  ==  e(f)  —  e(f)  G  R6  is  the  estimation  error  signal, 
K,p  G  R6x6are  positive  definite  constant  diagonal  gain 
matrices,  Iq  G  R6x6  is  the  6x6  identity  matrix,  to  is  the 
initial  time,  and  sgn(e(f))  denotes  the  standard  signum  func¬ 
tion  applied  to  each  element  of  the  vector  e(f).  The  above 
estimator  is  guaranteed  to  asymptotically  identify  the  signal 
e(t)  provided  e(f),e(f),  e  (t)  G  and  the  gain  matrix 
p  satisfies  the  inequality  p,  >  |  |  +  |  e » |  ,  V*  =  1,2,. ..6.  It 

is  assumed  that  the  relative  velocity,  acceleration  and  jerk 
between  the  moving  object  and  the  camera  are  bounded, 
i.e.,  v(t),v(t),v(t)  G  £oo.  Given  these  assumptions,  the 
structure  of  (22)  and  (23)  allows  us  to  show  that  the  bounds 


on  e(t),  e(t)  and  e  (t)  are  satisfied.  Hence,  based  on  the 
analysis  in  [4],  e  (t)  — >  e(t)  as  t  — >  00.  Since  J(t)  is  known 
and  invertible,  the  six  degree-of-freedom  relative  velocity 
between  the  object  and  the  camera  can  be  identified  as 
follows 

v(t)  =  J~l(t)  (e  ( t )  -  /(f))  (29) 

and  v(t)  — >  v(t)  as  f  — >  00. 

Remark  3:  The  definition  of  relative  velocity  in  (15)  can 
be  expressed  in  the  following  equivalent  form 

Vr  —  RcoVcc  UJr  RcoiOcc  .  (30) 

Since  we  made  the  assumption  that  the  camera  velocity 
[  vjc  iJ^c  ]  is  known,  (28)  and  (30)  allows  us  to  recover 
an  estimate  for  the  object  velocity  [  v^a  ]  . 

V.  Euclidean  Structure  Estimation 

To  facilitate  the  development  of  an  estimator  for  Euclidean 
coordinates  of  the  feature  points  on  the  object  (i.e.,  the  vector 
Si  relative  to  the  object  frame  O,  rhi(t)  and  m*  relative  to 
the  camera  frame  C  for  all  i  feature  points  on  the  object), 
we  first  define  the  extended  image  coordinates  pei(t)  G  R3 
as 


pei  =  [  Hi  Vi  -ln(ai) 


(31) 


From  the  development  in  the  previous  section  for  transla¬ 
tional  kinematics,  the  following  expression  for  time  deriva¬ 
tive  of  the  (31)  can  be  obtained 

Pei  —  _  ..  -4 eiRCoVr  T  _  *  AeiS(xco)iVcc 

miz 

T  _  *  AeiRcoS(si)u)r 

miz 

=  Wuvvwei  +  W2i[ei\l  (32) 


where  Wu(-)  G  R3x3,W2l(-)  G  R 3,Vvw(t)  G 
9i  G  R4  are  as  follows 

d/i  —  OiiAeiRco 

FF2  i  —  exiAeiS(xco)uJcc 
Vvw  =  [  Vr  S(ur)  ] 


R3x4,  and 


(33) 

(34) 

(35) 

(36) 


In  (32),  the  notation  [0i]l  denotes  the  first  element  in  the 
vector  0i.  Note  that  in  (32  ),  we  have  linearly  parameterized 
the  time  derivative  of  the  extended  image  coordinates  in 
terms  of  known  or  measurable  quantities  Wu(-)  and  W/iO) 
and  the  unknowns  Vvw(t)  and  6/.  An  estimate  for  Vvw(t), 
denoted  by  Vvw  (f ) ,  is  available  from  re-arranging  the  vector 
v(t)  computed  from  (29).  Our  objective  in  this  section  is 
to  develop  an  estimator  for  the  unknown  constant  9i  in  (36) 
which  will  allow  us  to  compute  the  Euclidean  coordinates  of 
the  ith  feature  point  on  the  object.  To  facilitate  this  objective, 
the  parameter  estimation  error  (/(f)  is  defined  as  follows 


0i  =  9i  -  9i 


(37) 


where  9i(t)  £  R4  is  a  subsequently  designed  parameter 
update  signal.  Motivated  by  the  subsequent  stability  analysis, 
we  introduce  a  measurable  filter  signal  (i(t)  £  R3x4,  and  an 
un-measurable  filter  signal  Vi(t)  £  R3  defined  as  follows 

6  =  -PiCi  +  W3i  (38) 

Vi  =  -PiVi  +  WuVvw6i  (39) 

where  (3i  £  R  is  a  scalar  positive  gain,  and  Vvw(t)  = 
Vvw(t )  —  Vvw(t)  £  R3x4  is  an  estimation  error  signal,  and 

W3t  =  WuVvw  +  [  W2i  03x3  ]  .  (40) 

The  stability  analysis  presented  in  the  next  sub-section 
motivated  the  following  design  for  estimates  of  the  extended 
image  coordinates  pei(t),  denoted  by  pei(t)  £  R,  and  an 
adaptive  least-squares  update  law  [18]  for  the  Euclidean 
parameters  9i,  as  follows 

Pei  =  PlPei  +  CiOl+WliVvw9i  +  W2i[9l}1  (41) 

9i  =  Lit? pei  (42) 

where  pei  (t)  =  pei  ( t )  —  pei  {t )  £  R3  denotes  the  measurable 
estimation  error  signal.  L.pt)  £  R4x4  is  an  estimation  gain 
recursively  computed  as 

|(ir1)  =  cfc,  (43) 

and  initialized  such  that  Li_1(0)  >  0  to  ensure  that  it  is 
positive  definite  for  all  time  t  as  required  by  the  stability 
analysis.  From  (32)  and  (41),  the  time  derivative  of  the  esti¬ 
mation  error  in  extended  image  coordinates  can  be  computed 
as  follows 

pei=  -PiPei  -  Ci  +WliVvw9i  +  W3i9i.  (44) 

The  above  equation,  and  (39)  allows  us  to  develop  an 
alternate  expression  for  pei(t)  as  follows 

Pei  =  (A  +  Vi-  (45) 

A.  Stability  Analysis 

Theorem  1:  The  update  law  defined  in  (42)  ensures  that 
9i(t)  — >  0  as  t  — +  oo  provided  that  the  following  persistent 
excitation  condition  [18]  holds 

rto+T 

Tihxi<  /  (i  {f)C,i{T)dT  <  72/4x4  (46) 

Jto 

and  provided  that  the  gains  /3j  satisfy  the  following  inequal¬ 
ities 

Pi  >  ku  +  kvWWuWl  (47) 

ku  >  2  (48) 

where  to,  71, 72,  T,  ku,  k2i  £  R  are  positive  constants, 
1.4x4  £  R4x4is  the  4  x  4  identity  matrix,  and  the  notation 
ll-ll^  denotes  the  induced  oo-norm  of  a  matrix  [18]. 

Proof:  Let  V(t)  £  R  denote  a  non-negative  scalar  function 
defined  as  follows 

V  =  \oJ  L-^i  +  ^vIvi-  (49) 


After  taking  the  time  derivative  of  (49),  the  following  ex¬ 
pression  can  be  obtained 

V  =  ~\  0i  ~  Oj (i  Vi  ~  Pi\\Vi\\2 

+vf  WuVvw9i 
<  -\  C i9i  -Pi\\Vif 

+  11^1111^1100  V™,  \\Vi\\ 

00 

+  C %0i  \\Vi\\  -  kli  \\vi\\2  +  ku  \\vi\\2 

+fc2i|I^H|lLll^l|2-^|l^1i||2ooll%H2  (50) 

where  ku,k2i  £  R  are  positive  constants  as  previously 
mentioned.  Further  simplification  of  (50)  after  utilizing  the 
non-linear  damping  argument  [13]  results  in  the  following 
expression 


17  s  -6-£)  +2 

-  (Pi  -  ku  -  k2i  \\WU 

\t)  HU2 

+  7 —  |  \9i\\  Vvw 

K2i  00 

(51) 

The  gains  Pi,ku  and  k2i  are  selected  to 

enstue  that 

1  1 

2  ku  ~ 

[ft 

V 

0 

(52) 

Pi  -  ku  -  k2i\\Wu\\2oo  ^ 

to 

V 

0 

(53) 

where  pu,  p2 *  £  R  are  positive  constants.  The  above 
gain  conditions  allow  us  to  further  upper-bound  the  time 
derivative  of  (49)  in  the  following  manner 

V  <  -pu  (A  -  P2i\\Vi\\2  +  T-Pif  Vvw  -(54) 

K2i  00 

In  the  analysis  provided  in  the  development  of  the  kinematic 

estimator  in  [4],  it  was  shown  that  a  filter  signal  r(t)  £ 

R6  defined  as  r(t)  =  e(t)+  e  (t)  €  £00  n£2-  From  this 

result  it  is  easy  to  show  that  the  signals  e(t),  e  ( t )  £  C ^ 

n£2  [6].  Since  J(t)  £  Loo  and  invertible,  it  follows  that 

v(t)  =  e  (t)  £  Coo  D  C2.  Hence  it  follows  that 

2 

Vvw(t)  £  Ci  and 

OO 

f  j—  \\9i(T)\\2  Vvw (r )  dr  <  s  (55) 

Jo  tt2  i  00 

where  £  £  R  is  a  positive  constant.  From  (49),  the  integral 
of  (54),  and  (55),  it  can  be  concluded  that 

(tu  C i(r)9i(T)  +/X2i||r/i(r)|| 2 j  dr 

<  H(0)  -  V(oo)  +  £.  (56) 

Hence,  (i(t)9i(t),Vi(t)  £  C 2. Also,  from  (56)  and 

the  fact  that  V(t)  is  non-negative,  it  can  be  concluded 
that  V(t)  <  H(0)  +  £  £  Coo-  Therefore,  from  (49), 
9f  ( t)L£l{t)9i{t),Vi{t )  £  Coo-  Based  on  the  assumption  that 
the  persistent  excitation  condition  in  (46)  is  satisfied,  and 
L“1(0)  is  chosen  to  be  positive  definite,  we  can  use  (43)  to 


show  that  LJx{t)  is  always  positive  definite;  hence,  it  must 
follow  that  9i(t)  £  Coq.  Since  W:a(-)  in  (40)  is  composed  of 
bounded  terms,  and  is  the  input  to  the  stable  filter  in  (38),  it 
can  be  shown  that  C i(9>Ci(9  £  £oo  [6],  and  consequently, 
£  £oo-It  follows  from  (45)  that  pei{t)  £  C^, 

and  hence,  from  (42),  9,  ( t),9i  (t)  £  Based  on  these 
arguments,  it  is  easy  to  see  that  A 

Therefore,  Q(t')9l(t)  is  uniformly  continuous  [7],  and  since 
we  also  have  that  C i{t)6i(t)  £  C ^  (T  £2,  we  can  conclude 
the  following  result  from  Barbalat’s  Lemma  [7] 

C i(t)9i(t)  — »  0  as  t  — >  00.  (57) 

If  the  signal  (i(t)  satisfies  the  persistent  excitation  condition 
given  in  (46),  then  it  can  be  concluded  [18]  from  (57)  that 

9i(t)  — *  0  as  t  — >  00.  (58) 

□ 

Remark  4:  From  (7),  (9),  (36),  (42),  and  Theorem  1,  the 
constant  Euclidean  position  of  all  feature  points  on  the  object 
at  the  reference  position,  and  the  time  varying  Euclidean 
position  of  the  feature  points  on  the  moving  object  relative 
to  the  camera  can  be  computed  as  follows 

m*{t)  =  r„  1  A~lp*  (59) 

m(t)  =  - - r^-A~1pi(t).  (60) 

9i(t) 

VI.  Conclusions 

This  work  developed  a  technique  for  the  recovery  of  struc¬ 
ture  and  motion  for  the  general  case  where  both  the  object  of 
interest  as  well  as  the  camera  are  in  motion.  Homography- 
based  techniques  and  adaptive  estimation  theory  provided  the 
basis  for  the  estimator  design.  Future  work  will  include  the 
validation  of  the  proposed  estimator  through  simulations  and 
experiments.  The  applicability  of  this  work  for  tasks  such 
as  aerial  surveillance  of  moving  targets  is  well  motivated 
since  no  explicit  model  describing  the  object  or  its  motion 
is  required  by  the  estimator. 
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Appendix  I 

Development  of  Translation  and  Rotation 
Kinematics 

To  facilitate  the  development  of  the  kinematic  equations, 

the  following  properties  of  rotational  matrices  and  skew 
symmetric  matrices  are  utilized  [19] 


Rc 

—  RcSM 

(61) 

Rc 

0$ 

o 

y 

to 

II 

(62) 

Ro 

O 

0 

y 

toT 

II 

(63) 

S(u)  C 

II 

i 

30 

£ 

(64) 

S{Ru) 

=  RS(u)Rt 

(65) 

where  R(t)  £  SO(3)  is  a  rotation  matrix,  u(t),^(t)  £  1R3 
are  some  arbitrary  vectors,  and  the  rest  of  the  terms  were 
defined  previously. 

To  develop  the  translational  kinematics,  the  time  derivative 
of  (13)  is  obtained  as  follows 

(X  i  ^ 

ev  =  -rr^Ae  1  TOi  (66) 


From  Figure  1,  the  position  and  orientation  of  the  object 
relative  to  the  camera  are  as  follows 


xco  =  Rc  Oo  -  xc)  (67) 

Rco  =  RcRo ■  (68) 

After  taking  the  time  derivative  of  the  above  equations,  and 
utilizing  the  properties  in  (61)  to  (65),  we  have 

xco  —  S(xco)uicc  ~\~  Rc  (tt0  nc)  (69) 

Rco  =  RtcS{u0)R0-RtcS{loc)R0.  (70) 

Utilizing  (69)  and  (70),  the  expression  for  the  time  derivative 
of  (1)  for  i  =  1  is  given  by  the  following  expression 

m  1=  S(xco)i0cc  -  Rcovr  +  RcoS(si)ur.  (71) 


After  substituting  (71)  in  (66),  the  translational  kinematics 
in  (14)  can  be  obtained. 

The  development  of  rotation  kinematics  follows  the  de¬ 
scription  given  in  [16],  to  which  the  reader  is  referred  to 
for  details.  Only  a  brief  description  is  given  here.  We  define 
Lj(t)  £  R3  to  be  the  relative  rotational  velocity  between  the 
frames  O*  and  O  as  observed  by  the  camera.  Then  it  can 
be  shown  that  [19] 

R=  S(Q)R.  (72) 

The  open-loop  error  system  for  eu  (t)  is  derived  based  on 
the  following  exponential  parameterization  [19] 

R  =  hx3  +  S(n)  sin(»  +  2sin2(^)S'(^)2  (73) 

where  is  the  3x3  identity  matrix.  Using  (73)  and  its 
time  derivative  in  (72),  it  can  be  shown  that 

S(u>)  =  sin (<j))S(p,)  +  S(n)<fi  +  (1  —  cos  (</)))  S(S(n)fi). 

(74) 

To  facilitate  further  development,  the  time  derivative  of  (17) 
is  determined  as  follows 

=  ii(t>  +  (75) 

By  multiplying  (75)  by  (/3X3  +  5'(/t)2),  the  following  ex¬ 
pression  can  be  obtained 

(hx3  +  ^(f)2)  =  n<j>  (76) 

where  we  utilized  the  following  properties 

/rT/z  =  1,  /iT/i  =  0  (77) 

S(fi)2  =  WT~hx  3  (78) 

Likewise,  by  multiplying  (75)  by  —S(fi)2  and  then  utilizing 

(77),  the  following  expression  is  obtained 

-S(^)2eu  =  (79) 

From  the  expression  in  (74),  the  properties  given  in  (64), 
(75),  (76),  (79),  and  the  fact  that 

sin2(</>)  =  ^  (1  -  cos(20)) , 

we  cab  obtain  the  following  expression 


t  (sec) 


Fig.  2.  Estimation  error  in  the  coordinates  of  a  feature  point  on  the  object 
(fixed  camera,  1%  pixel  noise  added). 


where  Lu  ( t )  is  defined  in  (20).  From  (72),  (4),  (68),  and  the 
properties  in  (62)  and  (63),  it  can  be  shown  that 

u>  =  — Rcou>r  (81) 

where  u>r(t)  £  R3  was  defined  in  (15).  After  multiplying 
both  sides  of  (80)  by  Lu(t),  and  utilizing  (81),  the  rotational 
kinematics  given  in  (19)  can  be  obtained. 

Appendix  II 

Special  Cases:  A  Fixed  Camera,  or  a  Fixed  Object 

In  the  previous  sections,  we  presented  the  development 
for  the  general  case  where  both  the  camera  and  the  object 
are  in  motion  relative  to  an  inertial  frame.  As  described  in 
Sections  IV  and  V,  such  a  treatment  imposed  a  few  restrictive 
assumptions  on  the  system,  chief  among  them  being  that 
the  persistent  excitation  condition  of  (46)  was  satisfied 
at  all  times,  the  camera  velocity  [  vjc(t)  (x^c(t)  ]  was 
measurable,  a  single  position  vector  si  was  known,  and  the 
orientation  of  the  object  at  a  reference  position  relative  to 
the  camera  (i?*D)  was  known.  The  fact  that  Si  and  R*0 
was  available  enabled  us  to  compute  the  depth  information 
TO*,  and  the  position  vector  xco(t),  apart  from  the  rotation 
matrix  Rco(t).  As  shown  below,  for  the  special  cases  where 
either  the  object  or  the  camera  are  stationary,  some  of  these 
requirements  can  be  relaxed,  increasing  their  potential  for 
application  in  a  wider  range  of  real-world  scenarios.  These 
cases  are  derived  in  detail  in  [3],  and  here  we  present  a 
summary. 

1 )  Fixed  Camera:  When  the  camera  is  fixed  relative  to 
the  inertial  frame,  the  kinematics  expression  in  (21)  is  given 
by 

e  —  J fc^oo  (82) 

where  Jfc(t)  €  R6x6  is  the  following 

t  '  -AelRco  ~^-AelRcoS(Sl) 

Jfc  =  <z  <z 

_  ^3x3 


U  = 


(80) 


(83) 


The  time  derivative  of  extended  image  coordinates  in  (32) 
reduces  to  the  expression  given  below 

Pei  =  WuVvw9i  (84) 

where  Wu(-)  =  aiAeiRco  G  K3x3  and  Vvw  = 

[  v00  S(lu00)  ]  G  R3x4.  All  assumptions  and  requirements 

are  the  same  as  in  the  general  case,  except  that  the  estimation 
of  the  object  velocity  and  structure  relative  to  the  fixed 
camera  requires  no  additional  sensors  on  the  camera  (since 
ojcc(t)  =  0),  and  the  computation  of  the  position  vector 
Xco(t)  is  not  required.  As  an  example.  Figure  2  shows  the 
errors  in  the  estimation  of  the  Euclidean  coordinates  of  one 
of  the  features  points,  in  the  simulation  of  a  planar  object 
moving  in  the  field  of  view  of  a  fixed  camera. 

2)  Moving  Camera:  With  the  object  stationary  relative  to 
the  inertial  frame,  the  kinematics  of  the  moving  camera  can 
be  expressed  as  follows 


^  —  JmcVc 


(85) 


where  Jmc(t)  G 


5x6 


is  the  following  Jacobian-like  matrix 


Jmc  — 


Aels(m1) 


03 


x3 


-Lu 


(86) 


defined  as  follows 


R3x3,W2i(-)  G  M3x3, 

and  Oi  G  R  are 

IFi  j  =  a-iAei 

(89) 

W2i  =  AeiS{mi) 

(90) 

@i  =  TT5T- 

(91) 

miz 

A  measurable  filter  signal  Q(t)  G  R3,  similar  to  (38),  is  now 
defined  in  the  following  manner 


C i  —  fiiCi  A  W\ iVc 


(92) 


where  vcc(t)  G  M3  is  the  estimate  for  translational  velocity 
of  the  camera.  In  [3],  following  an  approach  similar  to  that 
presented  in  Section  V,  it  is  shown  that  the  depth  parameters 
9i  can  be  estimated  using  the  adaptive  least-squares  update 
law  of  (42),  and  subsequently,  the  Euclidean  position  of  all 
feature  points  on  the  object  relative  to  the  camera  can  be 
estimated.  Again,  the  need  for  additional  on-board  sensing 
is  eliminated.  Also  note  that  as  a  consequence  of  the  way 
the  filter  signal  Q(t)  in  (92)  is  formulated,  the  persistent 
excitation  of  (46)  is  also  simplified,  as  it  now  depends  only 
on  the  translational  velocity  signal. 


In  the  above  expression,  notice  that  all  terms  are  either 
known  apriori,  or  directly  measurable,  except  the  constant 
depth  Kz-  If  the  camera  can  be  moved  away  from  its 
reference  position  by  a  known  translation  vector  x  k  G  R3’, 
then  fh*z  can  be  computed  offline  without  the  knowledge  of 
R*a  and  si.  Decomposition  of  the  Euclidean  homography 
between  the  normalized  Euclidean  coordinates  of  the  feature 
points  obtained  at  the  reference  position,  and  at  Xk  away 
from  the  reference  position,  respectively,  can  yield  the  scaled 

translation  vector  G  R3.  Then,  it  can  be  seen  that1 

d* 


d*  _  d* 
n*Tm\  n*TA~1pl ' 


(87) 


Hence,  from  the  above  discussion,  we  note  that  it  is  possible 
to  develop  a  velocity  estimator  for  the  moving  camera  case 
without  having  to  know  the  reference  orientation  matrix  R  *a 
and  any  geometric  length  si  on  the  object.  However,  as  a 
consequence  of  the  specific  approach  we  have  employed 
in  the  development  of  the  Euclidean  estimator  in  Section 
V,  we  are  unable  to  benefit  from  this  relaxation  in  the 
assumptions  for  velocity  estimation.  In  [3],  we  formulate 
the  Euclidean  estimation  in  a  slightly  different  manner  that 
allows  us  to  eliminate  the  requirement  of  R*co  and  si;  i.e., 
the  time  derivative  of  the  extended  pixel  coordinates  in  (32) 
is  expressed  as 


Pei  —  i iVcc0i  T  W2 i&cc  (88) 


'Note  that  for  any  feature  point  Oi  coplanar  with  7r*,  m*z  could  be 
computed  this  way. 


